Indexing `DataFrame`
In pandas, both Series and DataFrame objects can have indices applied to them. An index serves as a row-level label, corresponding to axis zero. Indices can be autogenerated or explicitly set. This guide covers various methods for handling indices in pandas, including setting, resetting, and using multi-level indices.
Setting Index
The set_index() function is used to set one or more columns of a DataFrame as its index. Note that this function is destructive; it doesn't keep the current index. To preserve the current index, manually copy it to a new column before setting a new index.
Example: Setting Index
import pandas as pd
# Importing the dataset
df = pd.read_csv("datasets/Admission_Predict.csv", index_col=0)
df.head()
# Preserve the serial number into a new column
df['Serial Number'] = df.index
# Set the index to 'Chance of Admit'
df = df.set_index('Chance of Admit ')
df.head()
Resetting Index
The reset_index() function converts the index back into a column and creates a default numbered index.
df = df.reset_index()
df.head()
Multi-Level Indexing
Pandas supports multi-level indexing, similar to composite keys in relational databases. This feature allows you to create hierarchical indices using multiple columns.
Example: Multi-Level Indexing with Census Data
# Importing census data
df = pd.read_csv('datasets/census.csv')
df.head()
# Filtering to keep only county-level data
df = df[df['SUMLEV'] == 50]
# Reducing columns for simplicity
columns_to_keep = ['STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013',
'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011',
'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()
# Setting a multi-level index
df = df.set_index(['STNAME', 'CTYNAME'])
df.head()
Querying with Multi-Level Index
When using a multi-level index, the loc attribute can take multiple arguments in order by the level you wish to query.
# Querying data for Washtenaw County, Michigan
df.loc['Michigan', 'Washtenaw County']
# Comparing two counties: Washtenaw and Wayne County
df.loc[ [('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')] ]
Hierarchical Labeling
Hierarchical indexing isn't limited to rows; it can also be applied to columns. This allows for complex data manipulation and is particularly useful for viewing data in a tabular form.
Transposing Data with Hierarchical Column Labels
By transposing a DataFrame, hierarchical column labels can be used effectively.
df.T # Transposing the DataFrame